BMC Bioinformatics
○ Springer Science and Business Media LLC
Preprints posted in the last 90 days, ranked by how well they match BMC Bioinformatics's content profile, based on 383 papers previously published here. The average preprint has a 0.37% match score for this journal, so anything above that is already an above-average fit.
Bokil, N. V.; Page, D. C.
Show abstract
BackgroundVisualization of epigenomic data such as coverage tracks, peak calls, and chromatin interactions is a critical task in genomic data analysis. Although genome browsers such as the Integrative Genomics Viewer (IGV) and the UCSC Genome Browser permit user-friendly exploration of genomic tracks, they are not optimized for fully programmatic and reproducible generation of publication-quality figures. In contrast, existing programmatic tools lack a user-friendly interface and require extensive configuration. ResultsWe present trackDJ (Track Display Jockey), an R package for visualization of epigenomic data. trackDJ prioritizes usability by favoring convention over configuration; it provides high-level plotting functions with sensible defaults, allowing users with minimal programming experience to generate clear, publication-quality figures with relatively little coding. Within a unified plotting framework, users can stack and align multiple data types, including coverage tracks, peak annotations, chromatin loops, and gene annotations. trackDJ allows users to select plotted genomic regions by coordinates or by gene name, enabling rapid visualization without knowledge of precise locus boundaries. ConclusionstrackDJ provides a user-friendly and reproducible alternative to interactive genome browsers for epigenomic visualization, filling a critical gap in currently available epigenomics toolkits. By enabling scripted generation of clean, customizable genomic illustrations, trackDJ integrates naturally into R-based analysis workflows to streamline the creation of publication-quality figures.
Overmann, M.; Grabert, G.; Kacprowski, T.
Show abstract
BackgroundGene expression profiling is widely used to investigate disease mechanisms, but classical approaches such as differential expression or pairwise correlation analyses provide limited interpretability. Network-based differential co-expression methods that model conditional dependencies through partial correlations offer richer insights, yet their application in high-dimensional settings requires estimation of precision matrices. Numerous precision matrix estimation methods (PMEMs) have been proposed, but their relative performance under various conditions remains unclear. ResultsSimulated gene expression datasets with known ground truth correlation structures were used to benchmark a broad set of PMEMs. Performance was strongly affected by data characteristics, including covariance structure, matrix density, covariance values, sample size-to-dimension ratio, and sampling distribution. Among the evaluated methods, GLassoElnetFast consistently showed the highest accuracy in recovering differential edges, although high signal-to-noise ratios and sufficient sample sizes remain essential for reliable inference. ConclusionsEvaluation across diverse simulation conditions demonstrated that no single metric or condition was sufficient to assess PMEM performance. Therefore, previous less extensive evaluations risked misleading conclusions. Our simulation and benchmarking framework supports future method development and ensures reproducible evaluation of newly developed approaches.
Roule, T.; Akizu, N.
Show abstract
BackgroundDespite their use, quantitative comparison of epigenomic datasets such as ChIP-seq and CUT&RUN remains challenging, particularly due to difficulties in signal normalization across samples and conditions. Normalization solely based on sequencing depth is often insufficient due to the high variability in signal-to-noise ratios across samples, even from a same experiment. While exogeneous spike-in normalization can address some issues, robust spike-in controls are not always available, and may introduce additional experimental burden and computational complexity. Furthermore, normalization and differential binding analysis are typically performed using separate bioinformatics tools. Indeed, most differential analysis frameworks operate on raw count matrices, preventing users from visually inspecting normalized signal tracks and evaluating how normalization influences the results. To overcome these challenges, we developed GNOMES (Genome-wide NOrmalization of Mapped Epigenomic Signals), a framework that integrates signal normalization, quality control, and differential binding analysis within a unified workflow. ResultsGNOMES is a user-friendly tool able to process ChIP-seq and CUT&RUN datasets from aligned reads, and generate normalized coverage profiles and differential binding results. The tool implements a robust genome-wide normalization strategy based on percentile scaling of signal local maxima, enabling stable normalization between biological replicates and conditions. GNOMES supports both single- and paired- end sequencing, does not required a negative control (input or IGG), and can be applied to both broad (histone marks) or narrow (transcription factor) enrichment patterns. The workflow includes normalization, optional consensus peak identification, and differential binding analysis. For each step, GNOMES generates extensive quality-control metrics and visual outputs, including normalized bigWig tracks, median signal tracks, BED files of regions with significant changes, and diagnostic plots such as heatmaps and PCA. GNOMES is highly configurable and integrates established tools such as MACS2 for candidate peak regions identification for differential binding analysis, as well as DESeq2 and edgeR for statistical testing. Finally, GNOMES is organism-agnostic and can be applied to epigenomic datasets from any model system. ConclusionsGNOMES provides an integrated and highly customizable environment for normalization and differential binding analysis of epigenomic sequencing data. By integrating signal normalization, with downstream differential statistical method for differential binding analysis, and comprehensive quality control, GNOMES simplifies the analysis of ChIP-seq and CUT&RUN datasets, for the identification of chromatin changes.
Grether, V.; Goldstein, Z. R.; Shelton, J. M.; Chu, T. R.; Hooper, W. F.; Geiger, H.; Corvelo, A.; Martini, R.; Davis, M. B.; Robine, N.; Liao, W.
Show abstract
BackgroundFormalin-fixed paraffin-embedding (FFPE) is a widely used, cost-effective method for long-term storage of clinical samples. However, fixation is known to introduce damage to nucleic acids that can present as artifactual bases in sequencing otherwise absent from higher fidelity storage methods such as fresh freezing (FF). Various machine learning methods exist for filtering these variant artifacts, but benchmarking performance can be difficult without reliable truth sets. In this study, we employ a collection of 90 paired fresh-frozen and formalin-fixed paraffin embedded samples from the same tumor to robustly define real and FFPE-derived, artifactual variation and enable objective evaluation of filtering methods. To address existing shortcomings, we propose a novel explainable boosting machine (EBM) model that improves performance, can be easily updated with new data, requires modest computational resources, and is analysis pipeline agnostic, making it broadly accessible. ResultsWe evaluated several methods for limiting FFPE-derived variant artifacts using cohorts of B-cell lymphoma samples. We found capturing local context around variants to be a highly informative, under-utilized feature set not commonly incorporated into many existing machine learning methods. Consequently, we developed a novel algorithm, FIFA, for filtering FFPE artifacts, which uses an EBM model, an interpretable decision-tree-based learning algorithm, to address some of the existing shortcomings. We used four independent cohorts composed of paired lymphoma and cervical cancer samples and a breast cancer cell line with both FF and FFPE samples to define clearly annotated training and test sets and demonstrated improved performance over existing methods. Additionally, FIFA filtering increased relevant biological signals in FFPE breast cancer datasets distinct from the training and testing sets. The EBM framework employed by FIFA is computationally efficient and easily amenable to incorporation of additional datasets due to its generalized additive modeling of features making it straightforward to incorporate new data into existing models dynamically over time. ConclusionsOur novel FFPE variant artifact filtering tool, FIFA, is a marked improvement over existing methods. It can be easily implemented, post hoc, to supplement existing somatic calling pipelines, training and inference can be run quickly across most compute environments, and it can be easily updated online as new training data becomes available. Accordingly, FIFA represents an important advance in retrospective cancer genomics research by further enhancing access to the vast stores of FFPE-archived tumor samples currently in existence.
Wang, J.; Robinson, M. D.
Show abstract
BackgroundLong-read RNA sequencing (lrRNA-seq) enables transcript-resolved variant detection, but systematic and neutral evaluations of small variants calling pipelines remain limited. The performance of existing tools across sequencing technologies, alignment strategy, variant caller choice, genomic contexts and downstream haplotype phasing is not fully understood. ResultsHere, we systematically benchmark four lrRNA-seq variant callers (Clair3-RNA, DeepVariant, longcallR, and longcallR-nn), along with a widely used short-read RNA-seq variant caller (GATK HaplotypeCaller) as a baseline, using Genome in a Bottle (GIAB) datasets comprising three cell lines sequenced with four Oxford Nanopore Technologies (ONT) and two PacBio library preparation protocols. We further evaluate the impact of upstream alignment strategies, including aligner choice and alignment transformation, on variant-calling performance. Accuracy is assessed across sequencing depths and genomic contexts. Additionally, we compare haplotype phasing tools (WhatsHap, LongPhase, HapCUT2, HiPhase and longcallR) using variant calls generated by different callers to identify optimal pipeline combinations. Finally, we extend our evaluation of variant-calling performance to more recent LongBench datasets. ConclusionsOur benchmark shows that sequencing quality is the primary determinant of lrRNA-seq variant-calling performance, followed by variant caller and alignment strategy, with additional effects from genomic context. In GIAB datasets, all lrRNA-seq-specific callers performed reasonably well, with Clair3-RNA (across both ONT and PacBio) and DeepVariant (for PacBio) ranking among the top-performing methods. In more recent LongBench datasets of cancer cell lines, DeepVariant and longcallR showed higher sensitivity, whereas Clair3-RNA and longcallR-nn were more conservative, yielding fewer variant calls. For downstream haplotype phasing, we recommend WhatsHap or HapCUT2 for most libraries, owing to their high phasing coverage and accuracy, respectively, while longcallR performs better on ONT dRNA004 datasets across both metrics.
Eulenfeld, T.; Collatz, M.; Braun, S. D.; Ehricht, R.
Show abstract
IntroductionAccurate in silico evaluation of primers and probes is essential for the rational design of molecular multi-parameter assays. We present Assay-BLAST v2 to automate and simplify this process for extensive assay designs. ResultsA newly integrated strand and proximity check enables precise validation of corresponding oligonucleotides, ensuring correct orientation and spacing for efficient amplification. Based on predicted oligonucleotide interactions, Assay-BLAST v2 estimates amplification outcomes, offering a computational benchmark for downstream wet-lab validation and performance correlation. Additionally, the updated software integrates an adaptive BLAST parameter optimization that dynamically scales with database size, thereby improving both analytical sensitivity and computational performance. These improvements are supported by a comparative evaluation against the previous version of AssayBLAST. ConclusionsCollectively, these enhancements streamline the assay development workflow, reduce costs associated with suboptimal primer and probe synthesis, and increase the robustness and reliability of molecular diagnostics and research applications.
Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.
Show abstract
MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it
Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.
Show abstract
BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.
Neubrand, N.; Rachel, T.; Litwin, T.; Timmer, J.; Kreutz, C.; Hess, M.
Show abstract
MotivationSystems biology strives to unravel the complex dynamics of cellular processes, often with the help of ordinary differential equations (ODEs). However, the sparsity of measured data and the strong non-linearity of common ODEs introduce severe numerical problems in typical modeling tasks. This gave rise to the development of many computational algorithms that must be systematically evaluated to ensure optimal method choices. Currently, the amount of well curated models for such benchmarking efforts is insufficient, as building and calibrating biologically reasonable models based on experiments requires years of work. ResultsWe present a large-scale collection of 1100 synthetic modeling problems, generated based on the ODE systems and experimental designs of 22 published modeling problems. This is achieved by extending a recent method for simulation of time-course data for randomly generated observation functions to also include realistic measurement patterns across multiple experimental conditions. By analyzing data and model characteristics, optimization performance and parameter identifiability, we show that the synthetic problems provide both a realistic and diverse extension of the existing problem space. Hence, the synthetic collection provides a valuable resource for benchmarking in dynamic modeling. Availability and ImplementationBenchmark problems and algorithm are publicly available at https://github.com/niklasneubrand/1100SyntheticBenchmarksODE and https://zenodo.org/records/14008247.
Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.
Show abstract
BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.
Gong, C.; Yang, Q.; Wan, R.; Li, S.; Zhang, Y.; Li, Y.
Show abstract
BackgroundJoint variant calling is a crucial step in population-scale sequencing analysis. While population-scale sequencing is a powerful tool for genetic studies, achieving fast and accurate joint variant calling on large cohorts remains computationally challenging. FindingsTo meet this challenge, we developed Distributed Population Genetics Tool (DPGT), an efficient computing framework and a robust tool for joint variant calling on large cohorts based on Apache Spark. DPGT simplifies joint calling tasks for large cohorts with a single command on a local computer or a computing cluster, eliminating the need for users to create complex parallel workflows. We evaluated the performance of DPGT using 2,504 1000 Genomes Project (1KGP), 6 Genome in a Bottle (GIAB) and 9,158 internal whole genome sequencing (WGS) samples together with existing methods. As a result, DPGT produced results comparable in accuracy to existing methods, with less time and better scalability. ConclusionsDPGT is a fast, scalable, and accurate tool for joint variant calling. The source code is available under a GPLv3 license at https://github.com/BGI-flexlab/DPGT, implemented in Java and C++.
Andrews, B.; Ranganathan, R.
Show abstract
MotivationDNA barcodes are commonly used as a tool to distinguish genuine mutations from sequencing errors in sequencing-based assays. In the presence of indel errors, utilizing barcodes requires accurate alignment of the raw reads to distinguish genuine indels from indel errors. Existing strategies to do this generally rely on aligners built for homology comparison and do not fully utilize quality scores. We reasoned that developing an aligner purpose-built for error correction could yield higher quality barcode-sequence maps. ResultsHere, we present BCAR, a fast barcode-sequence mapper for correcting sequencing errors. BCAR considers all of the evidence for each base call at each position both during alignment and during final consensus generation. BCAR creates high-accuracy barcode-sequence maps from simulated reads across a broad range of error rates and read lengths, outperforming existing methods. We apply BCAR to two experimental datasets, where it generates high-quality barcode-sequence maps. Availability and implementationBCAR source code, documentation and test data are available from: https://github.com/dry-brews/BCAR
Kim, M.; Cui, Y.; Kim, M. G.
Show abstract
BackgroundThe interpretation of high-dimensional transcriptomic data remains a major challenge in mechanistic toxicology and drug safety assessment. Conventional clustering approaches based solely on expression profiles often fail to capture intrinsic biological relationships among genes, limiting interpretability and downstream analysis. MethodsWe developed a hierarchy-aware gene exploration platform that integrates structured biological knowledge from the HUGO Gene Nomenclature Committee (HGNC). The core of the framework is a similarity kernel based on a single-step hyperdiffusion formulation (HKH{top}), which embeds gene family hierarchy into the similarity space. The platform is implemented as an interactive web application supporting Uniform Manifold Approximation and Projection (UMAP) visualization, Leiden clustering, functional enrichment analysis, and hierarchy-based gene recommendation. ResultsApplied to a transcriptomic dataset of acetaminophen-induced acute liver failure (APAP-ALF), the proposed approach achieved a 33.8-fold improvement in functional coherence compared to an expression-only baseline. The hierarchy-aware embedding produced compact and biologically consistent clusters, enabling identification of key toxicological modules, including disruption of RNA processing, extracellular matrix remodeling, and impairment of lipid transport. In addition, the framework detected small but highly significant regulatory modules associated with epigenetic reprogramming. ConclusionBy incorporating biological hierarchy into gene similarity, the proposed platform enhances the interpretability of transcriptomic analysis and enables structured exploration of functional relationships. This approach provides a practical framework for mechanistic insight generation and supports more transparent and reproducible analysis in toxicogenomics. AvailabilityThe web application is freely available at https://hgncgeneexplorer.streamlit.app/.
Petrov, P.; Izzi, V.
Show abstract
MotivationR together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the systems native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. ResultsThe hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availabilitycran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.
Kawato, S.
Show abstract
MotivationGenerating graphical diagrams of microbial and organellar genomes is a common and essential task in bioinformatics. Existing tools often present a trade-off; while powerful programming libraries that require coding skills, graphical applications require server processing or local installation with complex dependency. This highlights the need for a tool that offers both programmatic control for batch processing and graphical accessibility for ease of use. ResultsTo fill this gap, I developed gbdraw, a web application that generates circular and linear genome diagrams from self-contained GenBank or DDBJ files or combinations of GFF3 annotation and FASTA sequence files. Its core functions include visualizing annotated features, plotting GC content/skew tracks, and optionally generating pairwise sequence comparisons for comparative genomics. It is available as both a GUI web application and a command-line utility. Unlike existing web-based tools that require data upload to a remote server, gbdraw operates entirely within the users web browser. This serverless architecture ensures that sensitive sequence data never leaves the local machine, providing a secure environment for visualizing unpublished genomic data. Availability and Implementationgbdraw is implemented in Python 3 (version 3.10+) and is freely available under the MIT license. The web app is available at https://gbdraw.app/. Source code and documentation are available at https://github.com/satoshikawato/gbdraw. The local version can be installed from the Bioconda channel using a conda-compatible package manager.
Garcia-Ruano, D.; Georges, M.; Mohanty, S. K.; Baaziz, R.; Makova, K. D.; Nikolski, M.; Chalopin, D.
Show abstract
BackgroundLong non-coding RNAs (lncRNAs) have gained significant attention in recent years, yet distinguishing them from protein-coding transcripts remains challenging. Indeed, many lncRNAs share mRNA-like processing and existing sequence-derived signals do not fully capture the coding/non-coding boundary. Recent GENCODE annotation efforts revealed tens of thousands of novel lncRNA sequences as well as the reclassification of some lncRNAs into the protein-coding class, highlighting the need to better characterize transcript features associated with classification uncertainty and errors. ResultsWe performed uncertainty-aware benchmarking by retraining and evaluating eight transcript classifiers under a controlled protocol on a label-stable GENCODE v46-v47 subset. Beyond conventional model evaluation metrics, we quantified inter-tool agreement and entropy-based uncertainty to stratify transcripts into consensus, discordant, and consensus-error groups. To expand standard sequence and ORF-derived signals, we incorporated repeat-derived features from mature transcripts and non-B DNA motif features across gene bodies. Although aggregate performance was high, [~]45% of transcripts showed inter-tool discordance, particularly among lncRNAs. Feature analyses linked low-uncertainty predictions to strong coding-like signals, whereas high-uncertainty profiles exhibited mixed signatures. Alongside classical predictors in global importance analyses, repeat-derived features appear as main contributors. ConclusionsBy combining controlled benchmarking with transcript-level agreement and uncertainty stratification, together with extended feature profiling, we identified patterns associated with classifier disagreement and misclassification. This novel framework provides practical guidance for interpreting predictions, motivating the development of more robust coding/non-coding classifiers, while also shedding light on the sequence properties that distinguish lncRNA sequences.
Qorri, E.; Varga, V.; Priskin, K.; Latinovics, D.; Takacs, B.; Pekker, E.; Jaksa, G.; Csanyi, B.; Torday, L.; Bassam, A.; Kahan, Z.; Pinter, L.; Haracska, L.
Show abstract
BackgroundCircular RNAs (circRNAs) emerged as promising non-invasive cancer biomarkers due to their stability, abundance in body fluids, and regulatory potential. However, circRNA differential expression analysis (DEA) remains challenging, largely owing to lack of consensus on important preprocessing strategies such as filtering and normalization. While well-established bulk RNA-sequencing frameworks are commonly applied to circRNA data, newer approaches such as CIRI-DE (part of CIRI3 suite) integrate both linear and circular transcript information to improve detection. Despite developments, an assessment of these integrative strategies is lacking, and the critical impact of filtering on DEA model performance has not been comprehensively evaluated. ResultsIn this study, we evaluated the impact of multiple normalization and filtering strategies on circRNA DEA using five experimental datasets, including two in-house blood platelet sets and semi-parametric simulated in silico datasets. Our results emphasize the importance of selecting an appropriate filtering threshold, as overly lenient filtering substantially reduced model performance across datasets. We found edgeRs filterByExpr() strategy particularly effective in handling zero counts in circRNA data, while also generating the most reliable results across most datasets. Furthermore, by incorporating linear and circular information as described in CIRI-DE, most methods identified a higher number of differentially expressed (DE) circRNAs compared to circular counts alone. Notably, circRNAs identified by both CIRI-DE and the modified bulk RNA-sequencing pipelines showed substantial overlap. ConclusionOur findings demonstrate that automated filtering combined with linear-aware normalization significantly enhances the sensitivity and reproducibility of circRNA DEA, providing a standardized framework for more reliable biomarker discovery in transcriptomic research.
Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.
Show abstract
MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.
Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.
Show abstract
Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.
Bercovich Szulmajster, U.; Wiuf, C.; Albrechtsen, A.
Show abstract
Linkage disequilibrium is a central statistic in population genetic studies, commonly measured by the squared correlation between pairs of genetic variants. An important drawback of this measure is its upward bias caused by a finite sample size. To handle this, different methods exist that correct for sample-size bias. However, because the correlation consists of a ratio, there is no unbiased method to compute it. In this work, we present a procedure to calibrate those methods using a non-parametric approach with simulated data. This is done with forward modeling to generate genotype matrices with known parameters, followed by an inverse mapping to recover estimates of the underlying parameters. Then, a mean-centering calibration is applied to the recovered estimate of the true parameter. This approach is applied to real and simulated data, showing consistent improvement in accuracy compared to other sample-size-aware methods. Furthermore, to study the effects on downstream analyses, we analyze the classification performance on LD pruning, where we also observe an improvement, particularly in extreme cases with low sample sizes of 5 or 10 individuals.